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Abstract 

Functional linear regression is a useful extension of simple linear regression and has been 
investigated by many researchers. However, functional variable selection problems when 
multiple functional observations exist, which is the counterpart in the functional context 
of multiple linear regression, is seldom studied. Here we propose a method using group 
smoothly clipped absolute deviation penalty (gSCAD) which can perform regression esti- 
mation and variable selection simultaneously. We show the method can identify the true 
model consistently and discuss construction of pointwise confidence interval for the esti- 
mated functional coefficients. Our methodology and theory is verified by simulation studies 
as well as an application to spectrometries data. 

Keywords: Estimation consistency, functional linear regression, group SCAD, principal 
component analysis, selection consistency. 

1 Introduction 



In several applications, functional data appear as the basic unit of observations. Classical 
regression models may be inadequate for such cases because of the high correlations for 
the discretized data. Compared with the discrete multivariate analysis, functional analysis 
takes into account the smoothness of the high dimensional covariates, and often suggests 
new approaches to the problems that have not been di s covered before . Som e recent develop- 
ments i n functional regression include lYao et all (|200fil ): ICai and ICrambes et al 
(120091 ) ; lYuan and CaT(|2010+h . 



The literature contains an impressive range of functional analysis tools fo r various prob- 

lems. The more traditional approach, carefully documented in the monograph lRamsav and Silverman 

ypically starts by representing functional data by an expansion with respect to a 
certain basis, and subsequent inferences are carried out on the coefficients. Another line 
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of work by the French school, taking a nonparametric point of view, extends the tradi- 
tional nonparamet r ic tec hniques, most notably the kernel estimate, to the functional case 



(jFerratv and Vieul . I2006T ). Besides these two major approaches, other methods such as 



putting fu nctiona l regression in the reproducing kernel Hilbert space framework has been 
developed (IPredal . 120071 : iLianl . I2007I ). 

In this paper, we are concerned with an extension of simple functional linear regression 
model to the case where multiple functional observations are made on each unit. Formally, 
the model we consider is 



Yi 



p pi 
J 



i < n, 



where Xij are random functions, a is the intercept, are random scalar errors and the 
functional coefficients /3j, 1 < j < p are the objects of interest in the model. 

Because functional coefficients are more complicated objects than scalar coefficients 
in classical multiple linear regression, it is generally desirable to identify those significant 
variables in predic ting the responses, even if p is small. For example, in a similar context, 



Zhu et al.l (|2010+l ) investigated fluorescence spectroscopy for cervical precancer diagnosis, 
where a Bayesian model is used to select from multiple fluorescence spectra for classification 
of subjects. 

In a non-Bayesian context, traditional methods for variable selection in classical linear 
models include constructing hypothesis tests or using information criteria. More recently, 
regularization me thods have received much attention. For standard linear regression, Lasso 
(|Tibshiranil . ll99fih is probably the most popular method that u ses Li penalty to force some of 



the c o efficients to be equa l to zero. Several subsequent works (jMeinshausen and Buhlmann 



20061 : IZhao and Yul l200d ) have shown that Lasso is in general not consistent for model 
selection unless some nontrivial conditio ns on the covariates are satisfied. To address these 
shortcomings of Lasso . iFan and proposed the smoothly clipped absolute deviation 

(SCAD) penalty and Zou (2006) proposed the adaptive lasso in the fixed p case using a 
weighted L\ penalty with weights determined by an initial estimator, and there are many 
other extensions of the regularization fram ework for variable selection ( Yuan and Lira . 12009 : 
Wang and Len~l l2007l : iHuang et all . l2008l ) . 

In th is article we use functional principle compone nt analysis (PCA)-based estimation 



method ( Cardot et al. . 19991 : Hall and Horowitz 



terminology seems to first appear in I Wang et al 



2007) combined with group SCAD (this 



20071 ) for varying coefficients variable se- 



lection) which represents a new application of the SCAD penalty. Regularization method for 
variable selection in nonparametric settings has been de yeloped in the context of smoothing 
spline ANOVA fo r nonparamet ric regression smoother ( Lin and Zhang . 20061 ) and support 
vector machines (Zhang . 20061 ). For varying-coefficient models, we are aware of the work 



(jWang et al.l . |2008| ) where the authors used basis expansion a pproach for est i matio n com- 



bined with group SCAD penalty on coefficients, and the work (|Wang and Xil l200fll ) where 
group Lasso penalty is applied directly to smooth functions evaluated at sampled points. 
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The rest of the article is organized as follows. We describe the functional PCA and 
shrinkage estimation procedure in Section 2.1, and present estimation consistency and selec- 
tion consistency results in Section 2.2. Then we discuss estimation and inference algorithms 
and tuning parameter selection in Section 2.3 and 2.4 respectively. In Section 3, we present 
some simulation experiments and illustrate the proposed method using the spectrometries 
data example where the goal is to predict the percentage of fat content in the piece of meat 
based on spectra curves. All technical details are gathered in the Appendix. 

2 Methodology and Theoretical Properties 
2.1 Estimation of Multiple Functional Regression 

Assume we have independent and identically distributed (i.i.d.) observations (An , . . . , X ip , Yj), 
1 < i < n, where X^ is a square integrable random function on the interval [0, 1] with mean 
The response variables Yi are generated by model (pQ) with i.i.d. errors of finite 
second moments. We also assume the errors are independent of the predictors. We will 
use (Ai, . . . , X p ,Y) to denote the generic random variables with distribution the same as 
(Aji, . . . ,Xi p ,Yi). Let Kj(s,t) = Cov{Xj(s),Xj(t)}, and by Mercer's theorem we have the 
spectral expansion 

oo 
k=l 

where Aji > Xj2 > • • • > are the eigenvalues of the linear operator associated with Kj(s,t) 
with corresponding eigenfunctions 4>jk- On the other hand, let Kj(s,t) = - Yli=i(^ij( s ) ~ 
Xj{s)){Xij{t) — Xj(t)) where Xj = Yli-^-ij /n, we have the empirical counterpart of the 
above expansion, 

oo 

K j( s >t) = ^jk<j>jk(s)(j>jk(t), 
fe=i 

where A^i > Xj2 > ■ ■ ■ > 0. To get rid of uncertainty of signs, we assume J (ftjiftj > 0. For 
the empirical operator K, at most n eigenvalues are strictly positive. 

In general, different functional predictors are not independent of each other. The 
Karhunen-Loeve expansion of the random function X^ in terms of the orthonormal ba- 
sis 4>jk(t) is given by 

oo 

Xij - fij = /^jjk^jk, ( 2 ) 
fc=i 

where are principal component scores satisfying E^j^ = 0, E& k = \jk and E^ijkdjk' = 
0, k 7^ k' . Thus from ([2|) we have the covariance operator expansion 

oo 

Cov{X n (s),X j2 (t)}= K^hkxWhkAt), 

ki,k2=l 
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where X^f? = E£j 1 k 1 (,j 2 k 2 determines the dependency structure between different predic- 
tors. Note with our notation, when j\ = j% = j, X^ J fc = if ki ^ k% and A^ fcz = Xjk 
if h\ = h,2 = k. An illustration of how this dependency could arise is given in the next 
subsection. 

The model ([TJ can be equivalently written as 

P r 

Y i-» = J2J Pi ( x v + (3) 

3=1 

where (j, = E[Y\Xi, . . . ,X p ] = a + Y^j f PjfJ>j- After (3j is estimated by fij, say, the intercept 
a can be easily estimated by a = Y — Y2j J PjXj, where Y = ^ Yi/n. 

Now we consider the problem of estimating j3j. Using the orthonormal basis {4>jk}, © 
can be equivalently written as 

p oo 

Yi — 11 = Cijkbjk + Ej, 

3=1 k=l 

making use of the expansion (3j = Yk bjk4>jk- This suggests the estimator 

n p K 

{bjk} = argmin^(Yi - Y - Cijkbjk) 2 , 

i=l j=l k=l 

and then fij = Ylk=i bjk4>jk-, where in the above displayed equation f;^-*. = J (Xij — Xj)^^ 
is the principal component score estimated from data. Here the truncation point K is a 
smoothing parameter. To further select functional predictors simultaneously, we minimize 
the criterion function 

n p K p 

J(b) = J2(Yi - Y ^>>. ; /,/;,/,r +nX)p A (||6 i ||) ) (4) 

i=l j=l k=l j=l 

where is the I2 norm of bj = (bji, . . . , bjx) T '• Amon g many ways to spe cify the penalty 
function p\, we choose the SCAD penalty function of Fan and Lil (|200lh . which can be 
defined by 

p' x (6) = A 1.1(9 < A) + { "*~^ I(0 > A)} , p A (0) = 0, 

for a = 3.7 and 9 > 0, where /(•) is the indicator function. The choice of a = 3.7 is 
suggested by iFan and Lil (|200ll ) and adopted in almost all publications involving SCAD 
penalty. Other penalty functions such as the adaptive Lasso can also be used here and will 
lead to similar consistency results as below. 
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2.2 Consistency Properties 



Large sample pr operties of shrinkage estimation with SCAD penalty have b een established 
in the literature ( Fan and El bOQlUFan and Penj . 120041 ; IWang etaD . I2OO8I ) . We show that 
in our context the estimation procedure can consistently estimate the functional coefficients 
as well as consistently identify the true model. However, extending these theoretical results 
to multiple functional regression is not trivial. Note that in criterion dH) two types of 
approximations are involved, one is the truncation of f3j to approximate the functional 
coefficients, the other is the unknown covariate £ijk estim ated by {.jjk - Whi le the former 
approximation is typical in nonparametric problems such as Wang et al. ( 20081 ) . the latter is 
unique to the functional regression problem. It also rese mbles the measurement error mode l 
in fo rm where the covariates are not observed directly ( Liang and Li . 20091 ; Carroll et al 
2009h 



We denote the true regression coefficients by f3 = ((/3^) T , (f3^) T ) T with = (fii, . . . ,/3 s 
p containing all nonvanishing components of f3 and /3 s +i = • • • = (3 P = 0. Let A be the 
pK x pK matrix 

/ A 1 ' 1 • • • A x ' p 



(5) 



\ A?- 1 



Ap>p 



(c4) The parameter A satisfies A = o(K a ) and \J K 3a+3 /n = o(A). 

(c5) The minimum eigenvalue of A, denoted by /O m i n (A), is of order £l(K~ a ) where a r , 
Q(b n ) means b n = 0(a n ). 



where A J1J2 is the K x K matrix with entries A^'^,1 < k±,k2 < K. In our results, the 
following regularity conditions are needed: 

(cl) Xij has finite fourth moment: J E{Xf-) < 00, and Eej < 00. 
(c2) A fc - A fc+ i > C- x k- a - x , b jk < Ck-P, a > 1,(3 > a + 1/2. 

(c3) The smoothing parameter K satisfies ra - 1 /( 5a + 3 ) x — > 0, and n _1 /( 3a + 2 / 3 + 2 ) j{ [ s bounded 
away from 0. 



Remark 1 Because f3 > a+1/2, there exists K satisfying (c3). Also, we have W K 3a+3 /n 
o{K~ a ) when K satisfies (c3) and thus there exists A satisfying (c4)- 

Based on the assumptions listed above, we can establish the following result. 



Theorem 1 Assume (cl)-(c5), we have 

(a) (Estimation consistency) — f3j\ 

(b) (Selection consistency) fi s +i = ■■■'- 



= o p (l),l <j <p. 

(3 p = with probability converging to 1. 
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Note that the study of optimal convergence rates for multiple functional regression problem 
is more complicated and we do not attempt it here. 

An illustration. Let p = 2. Suppose the eigenvalues of K\ and satisfy Xjk = 
Ck~ a ,j = 1,2. If X\ and X2 are independent, then the matrix A defined in ([5]) is diagonal 
and its minimum eigenvalue is obviously of order £l{K~ a ). In general, A can be written as 
a block matrix 

E F 
F T G 



A 



where E and G are K x K diagonal matrices containing the eigenvalues of K\ and K2 
respectively. It is easy to see that the minimum eigenvalue of A is no bigger than CK~ a , 
since A is similar to 

E 

G-F T E- 1 F 



A 



(that is A = P AP, for some invertible matrix P), and obviously the eigenvalues of 
G - F T E- 1 F are dominated by those of G. In assumption (c5), we assume that the 
minimum eigenvalue of A is still of order K~ a as in the independent case. This assumption 
thus can be thought of as a constraint on the dependence of different predictors. However, 
we show in the following setup this assumption is quite natural. Suppose the random 
functions X\ and X2 are specified by 



(6) 



i=i 



where Wj,l < j <l are independent mean zero random functions with Karhunen-Loeve 
expansion give by Wj = L0jk4>k (note that we assume the eigenfunctions are common to 
all Wj) with Eujj k = > 0. The following proposition gives a sufficient condition under 
which p min {A) = n(K~ a ). 



Proposition 1 Suppose ck~ a < Kj k < Ck~ a ,j = 1, ... ,1, for some constants C > c > 0. 
If {aij}, {ci2j} are two fixed sequences and one is not a scalar multiple of the other (that 
is, for any constant 7, we cannot have a±j = 702^ or a2j = 701 j for all j), then p m - m (A) = 

n(K~ a ). 



2.3 Computation and Inferences 

One can express the criterion function J(b) in vector and matrix form. Let 

/ • • • iijK \ 



\ Cnjl ■ ■ ■ £,njK / 
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Z = (Zx, . . .,Z p ), b = (&u, . . .,b 1K ,b 2 i, . . .,b pK ) , Y = • • • , Yn) , the criterion (@J is 
written as 



v 



Jib) = £(Y -Yl- Zb) T (Y -Yl- Zb) +nY J Px{\\b j \\), W 

i=i j=i 

where 1 is the n-dimensional vector with all compo nents ones. 



We use the local quadratic approximation idea (|Fan and Lil . |2001| ) to optimize the cri- 
terion. Specifically, if b^ is the estimate obtained in the m-th iteration, then the criterion 
([7]) can be locally approximated by 

n 

£(Y -Yl- Zb) T (Y -Yl-Zb) + ^Y, bjR(b {m %, (8) 
i=l j 

where = diagW^^^/^^lK, . . . , ip'M^ \\) \\Vk} and I K is the 

K x K identity matrix. The minimizer of (|8|) is then given by 

g(m+i) = ^ T Z + nR{U m) ))- 1 Z T {Y - Yl). 

We iterate these steps until convergence and obtain the final estimate b. During the it- 
erations, depending on the choice of A, if some \\bj\\ is smaller than a threshold (10 -5 in 
our implementation), then we set bj = and ignore the corresponding predictor in future 
iterations. 

Now we consi der the construction of pointwise confidence intervals for /3j. Following 
Fan and Lil ( 2001 ). the sandwich formula can be used as an estimator for the variance of 



the nonzero components of b, denoted by b^ henceforth. The estimator of asymptotic 
variance is given by 

C^(SW) = ((zWfzW +nR^)-\zW) T C^v{Y)Z^{(Z^) T Z^ +nR^)-\ 

where Z^ denotes the selected columns of Z corresponding to nonvanishing 
denotes the selected rows and columns of R(b) in a similar way, and Cov(Y) is the n x n 
diagonal matrix with estimated squared residuals on the diagonal. The diagonal blocks of 
gives the asymptotic variance for nonvanishing bj. 
Since $j(t) = bj4>j(t), 4>j(t) = (4>ji(t), . . . , 4>jK{t)) T ' ■ We have the natural estimator for 
the asymptotic variance of (5j{t): 

Covifijit)) = faffiCoufatfjQ). 

Note that here we ignored the uncertainty of <f>j which is also estimated from observations. 
However, we think this might be a reasonable first approximation because one might argue 
that 4>jk is easier to estimate than f3j since needs to be estimated first and its accuracy 
affect the subsequent estimations. Of course this is just a heuristic argument and we later 
use our simulation experiments to illustrate the performance the asymptotic variance for- 
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mula. Estimates of the asymptotic variance can be used to construct pointwise confidence 
intervals for (3j(t) for nonzero components of the functional coefficients. Strictly speaking, 
the constructed intervals will only be for the truncated (3j(t) at cutoff K in the expansion. 
Thus the constructed intervals will have lower than targeted coverage rate: on the one 
hand the variability in 0^ is ignored, on the other hand the interval is only for truncated 
functional coefficients. The bias caused will be seen from our numerical results. 



2.4 Tuning Parameter Selection 

For implementation of our method, we need to choose two smoothing parameters, the 
truncation point K and the regularization parameter A for group SCAD penalty. In fact, for 
different predictors we might choose a different truncation point, which nevertheless would 
lead to a significant increase in computational burden, and thus we constrain the truncation 
point to be the same for all predictors. Besides, one can argue that the sensitivity of the 
estimator to the choice of K is reduced by the extra smoothing parameter A in regularized 
estimation. 

We use generalized cross-validation (GCV) to select both K and A. GCV can be thought 
of as a short-cut for leave-one-out cross-validation, and also comes with advantageous prop- 
erties dWahbal . Iliffloh . The GCV criterion is defined by 



1 IIY-Y1-YII 2 

GCV(K,X) = U L 

V ; n(l-tr(H(K,X))/n) 2 

where Y = H(K,X)(Y - Yl) is the fitted response values and H(K,X) = Z(Z T Z + 
Z T is the hat matrix. 



3 Numerical Experiments 
3.1 Simulation Examples 

We perform a Monte Carlo experiment to investigate the finite sample performance of the 
estimation method, using GCV to select the two tuning parameters. Our example follow 
the illustration presented after Theorem [TJ The simulated data is generated from model 
([T]) with p = 4 functional predictors, a = and the errors e distributed as N(0,a 2 ). For 
1 < j < 4 independently, we take Wj = Y^kLi^jk^k where ~ N(0,k~ 2 ), cf)\ = 1 and 
4>k+i = V2cos(k-7rt) for k > 1. Then the functional predictors are defined through the linear 
transformations 

Xi = W 1 + p(W 2 + W 3 ), 
X 2 = W 2 + p{W 1 + W 3 ), 
x 3 = W 3 + p(W 1 + W 2 ), 
x 4 = W A . 
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Table 1: Simulation results for penalized multiple functional regression. 



Scenario 



MSE OMSE TP FP 95% Cov.Prob.l 95% Cov.Prob.2 



p = 0.0, a = 
p = 0.2, a = 
p = 0.5, a = 
p = 0.0, a = 
p = 0.2, a = 
p = 0.5, a = 



0.1 
0.1 
0.1 
0.3 
0.3 
0.3 



0.73 0.64 2 0.08 0.92 0.93 

0.70 0.63 2 0.09 0.92 0.94 

0.97 0.85 2 0.08 0.93 0.94 

1.73 1.47 1.77 0.13 0.93 0.94 

2.06 1.81 1.81 0.11 0.93 0.94 

2.75 2.26 1.80 0.15 0.92 0.92 



Note the scalar p controls the strength of dependence between different predictors, with 
p = resulting in independent predictors. For /3i and /?2, in terms of expansion based on 
{<j> k }, we take b x = (-2, 1, -2, 1) T , b 2 = (1, -1, 0.5, -0.5) T and set /3 3 = /3 4 = 0. We fix 
n = 100 for all our simulations and set p = 0, 0.2 or 0.5, and a 2 = 0.1 or 0.3. All integrations 
required in the generation of the data and the estimation procedure are performed using a 
Riemannian sum approximation with an equally spaced grid containing 500 points on [0, 1]. 

The simulation results are summarized in Table Q] based on 500 runs in each scenario, 
where we report the mean squared errors ||/3 — /3|| 2 using our regularized multiple func- 
tional regression model (MSE), oracle mean squared errors where the true zero coefficients 
are known and no shrinkage is applied (OMSE), average number of correctly identified 
nonvanishing coefficients (TP), average number of incorrectly identified nonvanishing co- 
efficients (FP), empirical coverage probability of pointwise 95% confidence interval for f3\ 
(95% Cov.Prob.l) and empirical coverage probability of pointwise 95% confidence interval 
for 02 (95% Cov.Prob.2). For each scenario, the empirical coverage probabilities reported 
are the averages over the grid (0.1,0.2, . . . ,0.9) for j3\ and f3 2 whenever they are estimated 
as nonzero coefficients. 

As one can see from Table [H the noise level clearly has a significant effect on the estima- 
tion errors as well as the average number of truly relevant predictors detected. However, the 
number of false positives remains at a low level even for larger noise variance. Compared to 
noise level, the correlation between different predictors seems to have milder effects, with 
estimation error increasing with correlation strength. The result also shows that confidence 
intervals constructed based on the sandwich formula for the asymptotic variance work sur- 
prising well, with only a small downward bias in our simulations. As an illustration, the 
true functions /3\, (3 2 as well as their estimates when p = 0.2 and a = 0.1 or 0.3 are plotted 
in Figure [TJ 

3.2 Spectrometries Data 

We illustrate our approach on the real spectrometries dataset, which contains 215 spectra of 
light absorbance for meat samples as functions of the wavelengths. Because of the denseness 
of wavelengths at which the measurements are made, the subjects are naturally treated as 
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(a) (b) 

Figure 1: (a) The true coefficient j3i (solid line) with its estimates when a = 0.1 (dashed 
line) and a = 0.3 (dotted line), (b) /?2 (solid line) with its estimates when a = 0.1 (dashed 
line) and a = 0.3 (dotted line). Here we set p = 0.2. 

continuous curves. Figure [2] shows the first 50 curves in the dataset. This dataset has 
been previously used in functional nonparametric regression studies where the covariate is 
the spectra curve and the response is the percenta ge of fat content in the piece of meat 



tne spectra curve and tne response is tne percenta ge oi rat content m tne piece or meat 
(jFerratv and Vieul . I2002L 120061 : iFerraty et all . 120071 ). In nonparametric kernel regression, 



as shown in the above mentioned works, choice of semi-metric which defined the notion of 
distance between curves is crucial for the performance of the estimator. Previous study 
suggested that for nonparametric regression function estimation, taking as the semi-metric 
the L2 distance between the second derivatives of the spectra gives favorable results based on 
its performance on hold-out validation data. A desirable feature of an estimation procedure 
would be to determine the appropriate order of derivative automatically. 

Here we apply the multiple functional linear regression model to the spectrometries data. 
We treat the original function itself as well as up to its 3rd derivatives as the predictors 
in our model. The idea of using different orders of derivatives of curves as covariates in 
the functional linear model is similar to using transformations of the original covariates in 
classical multiple linear regression, which makes the linear model more flexible. Compared 
to nonparametric functional kernel regression, the functional linear model is more easily 
interpretable and thus an interesting alternative. For this data, we train on the first 160 
spectra and use the rest as validation. We examine the prediction accuracy of the estimated 
model using mean squared error on the validation data, defined as 

1 215 

MSE=-Y,(X-Y<) 2 - 

8=161 
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(a) 

Figure 2: The spectrometric curves. 

With the smoothing parameters selected by GCV, the relevant predictors are found to be 
the 1st and 2nd derivatives of the spectra curves, achieving an MSE of 8.31. Figure [3] clearly 
shows the ability of the estimated model to predict the responses. For comparison, we also 
computed the nonparametric kernel regression using the funopare. kernel. cv function pro- 
vided in the npfda package (it uses cross-validation to select the bandwidth) , which gives a 
smaller MSE of 5.37. However, when using functional linear modeling, unlike kernel regres- 
sion, we can visually examine the features of the functional coefficients for interpretation. 
For example, from Figure HI higher fat content is seen to be related to higher values around 
point 160 and lower values around point 215 in the first derivative, as well as lower values 
around point 190 in the second derivative. 

4 Concluding Remarks 

We propose in this article a regularization method for shrinkage estimation of multiple 
functional linear regression models. We have shown that the proposed method is consistent 
in estimation and variable selection. A computational algorithm based on loc al quadratic 



appr oximation is proposed. It is also possible to use local linear approximation (jZou and Li . 



2008) and our choice made here is based on ease of implementation since closed form so 



lution exists for each iteration. Our simulation results demonstrated the effectiveness of 
the method and the application to spectrometries data provides an interesting alternative 
perspective to the previously used kernel regression on this data. 

We would like to finish this paper by discussing some possible topics for future study. 
One possibility is to consider partially functional linear regression where scalar covariates are 
considered simultaneously. Variable selection can be applied to both the functional and non- 
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10 20 30 40 50 



(a) 

Figure 3: Prediction accuracy with penalized multiple functional regression on 55 validation 
samples. 



functional part. Another direction is to consider multiple functional linear regression when 
the number of predictors diverges with sample size. How to e xtend the shrinkage estimation 
results to generalized fu nctional linear model ( James . 2002I ; Muller and Stadtmuller . 2005 ; 
Cardot and Sardal . l2005h is another interesting topic for further study. 



Appendix 

The following two lemmas study some properties of the estimated principal component 
scores and are important for the proof of Theorem [U Throughout the appendix, we follow 
the notations and assumptions in the main text. 



Lemma 1 We have \£ijk — £ijk\ 
O p (K a+1 /^i). 



O p {K a+1 /y/n) and \ Y%=\ ZijikiZijite /n 



\3l,32 I 

A ki,k 2 * 



Proof. Since ^ ijk = J (-Xy - 

tdl 2 + \\d)jh - d)jh 
Hall and Horowitzl ((20071 ). 



0J\\Xi 



m 



Hj)(j)jk and £ijk = J(Xij - Xj)^ jk , we have \i ijk - djk\ 2 = 
| 2 ) = O p (K 2a+2 /n), using assumption (c2) and equation (5.2) 
For the second part, we have 



n A k u k 2 



ij 2 k 2 



+ 



n 



n 



$ij 2 k 2 



n 



A ki,k 2 
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(a) 



(b) 





(c) 



(d) 



Figure 4: (a) and (b): 1st and 2nd derivative of the spectrometric data. Only 50 samples 
are shown in the figure, (c) and (d): Estimated functional linear coefficient corresponding 
to 1st and 2nd derivative curves respectively, with 95% pointwise confidence interval shown 
as dotted lines. 
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Obviously the second term is of order O p (n 1 ' 2 ). The first term above is further decomposed 



as 



1 " 

(J) = 



j=l 



Using l^jjfel = O p (l) since - CijifeJ = p(!)> we nave a bound O p (if a+1 /\/rc) for (I), 

and the proof is completed. □ 

Lemma 2 For any A denoting a subset of {1, 2, . . . ,p}, let Za be the columns of Z corre- 
sponding to those predictors in A, and similarly let Aa be the submatrix of A corresponding 
to the predictors in A, then the minimum eigenvalue of Z^Za/ti is lower bounded by a 
constant multiple of K~ a , i.e. p m i n (Z^ZA/n) = ft p (K~ a ). 

Proof. We will use || • || to denote also the operator norm of a matrix, and use || • ||i for 
maximum row sum of a matrix. Since \p r ^ m {Z\Z aI n) — p m in(AA)| < ll-^A^/ n ~~ ^11 — 
WZ^Za/tt, — A^Hi = O p (K a+2 1 \fn) using the previous lemma. This together with assump- 
tion (c5) implies the statement of the lemma. □ 

Proof of Theorem [TJ. In the proof we denote the minimum eigenvalue of Z T Z /n by 
p* and thus p* = Q, p {K~ a ) by Lemma [2j The true functional coefficients are denoted by 
Pi = Efc bjk<Pjk- Then 

> J(b)-J(b) 

= llY-yi-ZSl^-llY-Yl-Zftl^ + n^pAGfelD-n^pACI^II) 

i j 

= ||Y-yi-Z6 + Z6-Z6|| 2 -||Y-yi-Z6|| 2 + n^ PA (||6 J ||)-n^p A (||6 i ||) 

3 j 

= 2 (Y-Fl-Z6) r Z(6-6) + ||Z(6-6)|| 2 + n^p A (||6 J ||)-n^p A (||5,||). 



Let r/ = Z(Z T Z) 1 Z T (Y — Yl — Zb) be the projection of Y — Yl — Zb onto the columns 
of Z, LemmaObelow shows that \\t]\\ 2 = O p (r 2 ), where r 2 = O p (K 2a+3 + nK~ a ~ 2l3+l ). By 
assumption (c3), we get nK- Q " 2/3+1 = 0(K 2a+3 ) and thus r 2 = 0(K 2a+3 ) and the above 
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displayed equation can be continued as 



> -O p (r n )||Z(6-6)|| + ||Z(6-6)|| 2 +n^p A (||6 J ||)-n^p A (||6 i | 



> -o p {ii) 



l\\Hb 



+ ||Z(6-5)|| 2 +n^p A (||6,||)-n^ PA (||6 J ||) (9) 



> -O p (rl)+np*\\b-b\\ 2 -n\J2\\k 



■j-h\ 



> -O p (r 2 n ) + np*\\b-b\\ 2 - 2 -^-^f\\b-b\\ 2 , 

where we used Cauchy-Schwartz inequality on the second line, the property \p\(a) —p\(b)\ < 
\\a — b\ on the third line, and Cauchy-Schwartz inequality again on the last line. Thus 
||S - b\\ 2 = O p (^p + jjgp) = op(l) by assumptions (c3) and (c4). 

The convergence rate for ||6 — 6|| 2 can be improved to O p (r 2 /np*), which is useful in the 



proof of part (b). Since ||6 — 6|| = o p (l) and A — > 0, we have P(p 



■a(INI)=pa(INI),i< 

j < s) — > 1 and thus J2j Pa (I \bj\ |) — J2j Pa (I \bj\ I) > with probability converging to 1. This 
combined with ([9]) gives ||6 — 6|| 2 = O p (r 2 /np*). 

From \ \bj — bj\\ = o p (l), part (a) is easily shown using the following decomposition 



%-&|| 2 < 3116,-6,1 



A 



k=l 



K 



+ 3 E b % 

k=K+l 

CO 



3\\b j -b j \\ 2 + 3Kj2b 2 jk \\hk-^k\\ 2 + ^ E b h 



k=l 



k=K+l 



SWbj-bj 



K 2a+2 °° 



ii 



k=K+l 



and obviously all the terms in the above display converge to zero. 

Now we prove part (b) of the theorem. Let 6 = (bf , . . . , bj, 0, . . . , 0) T , that is, 6 is 
obtained from 6 by constraining the truly irrelevant components to be zero. By similar 



15 



arguments for the proof of part (a), we have 
> J(b)-J(b*) 

= 2(Y -Yl- Zb*) T Z(b - 6*) + \\Z{b - b*)\\ 2 + n^Px(\\h\\) ~ "5>A(ll&ill) 

i j 

> -O p (||?f||)||£(S-S*)|| + n jr p x (\\bj\\) 

j=s+l 

> -O p {\\rf\\)yfr £ WkW + nX £ \\bj\\, (10) 

j=s+l j=s+l 

where rf = Z(Z T Z)- 1 Z 1 '(Y -Yl- Zb*). In the last line above we use the fact that ||Sjjj = 
O p (r n /^np*) = o p (A) when j > s (from the proof of part (a)) and thus Pa(II^'H) = ^ll^'ll- 
We bound \ \r]*\\ as follows. 

||?7*|| 2 < 2\\ V \\ 2 + 2\\Z(b* -b)\\ 2 

= O p (r 2 n ) + O p (nr 2 /(np*)) = O p (r 2 /p*). 

Since we have that O p (||«*ll) = o p (y/nX), we will have a contradiction in (fTU|) '^Y^= s +i > 
0. □ 



Lemma 3 Let rj = Z{Z T Z) l Z T {Y — Yl — Zb) as in the proof of Theorem^ then \ \n\ \ 2 = 
O p {rl) where r 2 n = K 2a+3 + nK~ a - 2fi + x . 

Proof. Denote by Z the matrix similar in structure to Z but contains the true principal 
component scores instead of We have the decomposition 

Y -Yl- Zb = e + (fi-Y)l + (Z - Z)b + u, (11) 

where e = (ei, . . . , e n ) T and v is a re-dimensional vector with i-th component given by 



P OO 

3=1 k=K+l 



Let Pg = Z{Z T Z) l Z T . Now r\ = Pg(Y — Yl — Zb) is the projection of the four terms in 
the decomposition (jlip onto columns of Z, and we bound each term in turn below. 

Since ||P^e|| 2 = e T P^e, using the fact E[e T P^e\X] = a 2 tr{P^) = a 2 pK, Var{e T P^e\X) = 
2aHr(Pl) + (Eei-3a 2 ) Y?_, (P ,)% < 2aHr(Pl) +\Ee'-2,a 2 \ E]=i( P z)jj = P {K) (see for 
example equations (3.3), (3.4) in lHuang and Fanl (Il999l )). where (Pz)jj are the diagonal ele- 
ments of Pg which are all no larger than 1 since P% is a projection matrix. Using the equal- 
ities Ee T P^e = E[E(e T P^e\X)} and Var{e T P^e) = E[Var(e T P 2 e\X)} + Var(E[e T P^e\X}), 
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we have 

\\P^e\\ 2 = O p (K). (12) 

Besides, 

\\Pz{Z - z)b\\ 2 < \\{Z-Z)b\\ 2 = 0{\\(Z-Z) T {Z-Z)\\). 
Using Lemma QJ we get \\{Z - Z) T {Z - Z)\ \ < \ \(Z - Z) 7 \Z - Z)\\ x = O p (K 2a+3 ) and thus 

\\P 2 (Z - Z)b\\ 2 = O p (K 2a+3 ). (13) 

Finally, 

oo oo 

k=K+l k=K+l 

oo 

= 0( k~ a k- 2 P) 
k=K+l 

= 0(K~ a ~ 2 ? +1 ) 

Since the number of predictors p is fixed, we have Var{ui) = 0(K- a - 2 ^ +1 ) and thus 

\\u\\ 2 = O p {nK- a - 2l3+1 ). (14) 

Combining CEJ), (USD, CE]) as well as \n - Y | = O p {n- l l 2 ), we get ||r?|| 2 = O p (r 2 ). □ 

Proof of Proposition [T]. From equation ©, the Karhunen-Loeve expansion of the 
predictors are given by 

oo I 

x i(t) = Ydk4>k(t), with £ ik = Y^aijbjji^i = 1,2. 

k=l j=l 

Using the notation in the main text, we have that the general entries of A are given by 

\«l,*2 — jpt C — j 'Ylij=l a iij a i2j K jk k\ = ki = k 



Thus in this case, in the block matrix form, 



A - i vT 



E F 
F T G 



and the matrix F is also diagonal. Since A is similar to the matrix 

, E 

A 1 G-F T E- 1 F 
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the eigenvalues of A are just the diagonal elements of E and G — F T E X F. The eigenvalues 
of E are £l(K~~ a ) by assumption, and the diagonal elements of G — F T E~ 1 F are 

E 2 (Ej=i aija2jKjk) 2 

a 2 j K jk =^ 2 

3=1 A?=l a lj K i k 

^l<jl^2<l^ ai ^ a2 hV K hk K j2k ~ a 2j 1 aij 2 \/ K jikKj 2 k) 2 



c T t i<j 1 ^j 2 <i( a iji a 2j 2 - a2j 1 aij 2 ) k 



2h,-2a 



□ 
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